Fast Phonetic Similarity Search over Large Repositories
نویسندگان
چکیده
Today there is a large amount of unstructured data produced by information systems from different domains. These sources may be analyzed for different purposes. Existing approaches use string similarity methods to search for valid words within a text, with a supporting dictionary. However, they have two main drawbacks. First, they are not rich enough to encode phonetic information to assist the search. Second, the solutions may be inefficient in the presence of spelling errors. In this paper, we present a novel approach for efficiently perform phonetic similarity search over large data sources. We present a data structure called PhoneticMap, which encodes language-specific phonetic information. The phonetic maps are used by a novel fast similarity search algorithm to find words with spelling errors. We validate our approach through an experiment over a data set using a Portuguese variant of a well-known repository, to automatically correct words with spelling errors.
منابع مشابه
Effective and efficient similarity search in scientific workflow repositories
Scientific workflows have become a valuable tool for large-scale data processing and analysis. This has led to the creation of specialized online repositories to facilitate workflow sharing and reuse. Over time, these repositories have grown to sizes that call for advanced methods to support workflow discovery, in particular for similarity search. Effective similarity search requires both high ...
متن کاملLarge Scale Machine Learning Jan 18 , 2016 Lecture 5 : Large - Scale Search : Locality Sensitive Hashing ( LSH )
Nowadays, there exist hundreds of millions of images online. These images are either stored in web pages, or databases of companies, such as Facebook, Flickr, etc. It is challenging to quickly find similar images from these huge repositories. This is because: • The repositories are huge. Facebook has around 10 billion images [2]. These images have different resolution, dimension. • Images are v...
متن کاملm3 - A Behavioral Similarity Metric for Business Processes
With the increasing uptake of business process management, companies maintain large scale process repositories consisting of hundreds or thousands of process models. So far, discovery within these repositories is limited to free text search or folder navigation. In a separate stream of research, similarity measures were introduced to get a better understanding of the relationships between proce...
متن کاملMetric Trees for Efficient Similarity Search in Large Process Model Repositories
Due to the increasing adoption of business process management and the key role of process models, companies are setting up and maintaining large process model repositories. Repositories containing hundreds or thousands of process models are not uncommon, whereas only simplistic search functionality, such as text based search or folder navigation, is provided, today. On the other hand, advanced ...
متن کاملOptimal Distance Bounds on Time-Series Data
Most data mining operations include an integral search component at their core. For example, the performance of similarity search or classification based on Nearest Neighbors is largely dependent on the underlying compression and distance estimation techniques. As data repositories grow larger, there is an explicit need not only for storing the data in a compressed form, but also for facilitati...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014